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EFFICIENCY AND COMPUTABILITY OF MCMC WITH 
LANGEVIN, HAMILTONIAN, AND OTHER 
MATRIX-SPLITTING PROPOSALS 

By Richard A. Norton and Colin Fox 
University of Otago 

We analyse computational efficiency of Metropolis-Hastings algo¬ 
rithms with AR(1) process proposals. These proposals include, as a 
subclass, discretized Langevin diffusion (e.g. MALA) and discretized 
Hamiltonian dynamics (e.g. HMC). 

By including the effect of Metropolis-Hastings we extend earlier 
work by Fox and Parker, who used matrix splitting techniques to 
analyse the performance and improve efficiency of AR(1) processes 
for targeting Gaussian distributions. 

Our research enables analysis of MCMC methods that draw sam¬ 
ples from non-Gaussian target distributions by using AR(1) process 
proposals in Metropolis-Hastings algorithms, by analysing the matrix 
splitting of the precision matrix for a local Gaussian approximation 
of the non-Gaussian target. 


1. Introduction. Many existing Metropolis-Hastings (MH) algorithms 
for sampling from a non-Gaussian target distribution it use an AR(1) process 
proposal; given current state x € the proposal y € M. d is given by 


(1) y = Gx + g + v 

where G G M. dxd is the iteration matrix, g G is a fixed vector and v is an 
i.i.d. draw from IV(0, E). In general, G , g and E may depend on x. The MH 
acceptance probability is 


a(x,y) = 1 A 


?! (y)q(y,x) 

ir(x)q(x,y) 


where tt(x) denotes the target probability density function and q(x, d y) = 
q(x , y)dy is the transition kernel for the proposal y given current state x. 

An important case of (1) is when the AR(1) process comes from a matrix 
splitting of some matrix A associated with it at x, for example when A is the 
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local Hessian of log7r, or its inverse, or an approximation to these matrices. 
In this case we find it more convenient and revealing to write (1) in terms 
of the matrix splitting A = M — N so that y is given by solving 

(2) My = Nx + /3 + v 

where /3 is a vector and v is an i.i.d. draw from IV(0, M T + N), G = M~ 1 N, 
g = M -1 /3 and H = M~ 1 (M T + N)M~ T . In fact, if the spectral radius of 
G is less than 1 then the AR(1) processes (1) can be written in terms of 
a matrix splitting and vice versa, see Section 2. Moreover, if the spectral 
radius of G is less than 1, then the proposal chain generated by (2) will 
converge to N{A~ l ^,A~ 1 ), which we call the proposal target distribution , 
see [12]. 

Later, we see that the Metropolis-adjusted Langevin algorithm (MALA) 
is an example of an AR(1) process proposal with a corresponding matrix 
splitting, as are other discretized Langevin diffusion proposals, as are also the 
Hybrid Monte Carlo algorithm (HMC) and other discretized Hamiltonian 
dynamics proposals. We will use this observation to further analyse these 
methods and discuss their computability in Section 5. 

Although identifying (1) with (2) is useful for analysing existing methods, 
if the task is designing a proposal for the MH algorithm then using (2) is 
more natural because we can begin by choosing A and /3. For example, we are 
particularly interested in the proposal where A and ft are chosen such that 
N(A~ 1 /3,A~ 1 ) is a local Gaussian approximation to 7r, i.e. — \x T Ax + /3 T x 
is a local quadratic approximation to log7r. 

After selecting A and /3, we must also choose the splitting M and N. This 
choice will effect the computational cost of each proposal, as well as how fast 
the AR(1) process will converge to the proposal target distribution. 

We may use one or more iterations of the AR(1) process (2) as a pro¬ 
posal for the MH algorithm. For the next proposal (if the previous one was 
accepted) we then use a new local Gaussian approximation to 7r. This idea 
mimics the design of some optimizers where a local quadratic approximation 
to the objective function is minimized at each iteration, see for example [18]. 

Our concern is the analysis of Markov chain Monte Carlo (MCMC) meth¬ 
ods with AR(1) process proposals which fall into three cases. 

1. The target and proposal target distributions are the same, i.e. 7r = 
N(A^ 1 /3,A^ 1 ) and the AR(1) process targets 7r directly. Then (1) is 
a fixed-scan Gibbs sampler for n, and the accept/reject step in the 
MH algorithm is redundant. Fox and Parker [11, 12, 13] have already 
studied this case. They showed how to choose M and N (equivalently 
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choose G and X) to obtain an efficient algorithm and they also showed 
how to accelerate the AR(1) process to achieve even greater efficiency. 

2. The target distribution it is normal, N(A~ 1 b, A -1 ) for some matrix 
A and vector b, but not targeted by the AR(1) process, i.e. A / A 
and/or b / /?. Then a MH accept/reject step is required to ensure it 
is targeted. 

3. The target distribution it is not normal, hence the AR(1) process does 
not target it, and a MH accept/reject step is required to ensure it is 
targeted. The proposal target is a local Gaussian approximation to ir. 

Cases 2 and 3 where the target and proposal target distributions are 
different are harder to analyse than case 1 as the MH accept/reject step 
affects the transition kernel. This begs the following questions in these cases: 

• How should we choose A, /?, M and N to construct a computationally 
efficient algorithm? 

• Can we accelerate MH algorithms with AR(1) process proposals for 
non-Gaussian target distributions? 

We measure the quality of a sampling algorithm by its computational effi¬ 
ciency, which is a combination of the compute time required for the Markov 
chain to reach equilibrium (burn in), and once in equilibrium, the compute 
time required to produce quasi-independent samples. Simply, we are inter¬ 
ested in getting the most ‘bang for buck’, where ‘bang’ is the number of 
quasi-independent samples and ‘buck’ is the computing cost. 

Fox and Parker [12] mainly considered distributional convergence from 
an arbitrary state, so they measured burn in. Here we are concerned about 
the compute time to produce quasi-independent samples. The integrated 
autocorrelation time (length of Markov chain with variance reducing power 
of one independent sample) is a proxy for independence, hence we try to 
measure compute time per integrated autocorrelation time. 

In this article we focus on analysing case 2 above where ir is the nor¬ 
mal N(A~ 1 b , A -1 ), but not the same as the proposal target N(A~ l f3, A” 1 ). 
However, our results are relevant to case 3 where tt is not normal because 
the cases share important features. In particular, the accept/reject step in 
the MH algorithm is used in both cases to correct the difference between the 
target and proposal target distributions. The simplification of only consider¬ 
ing case 2 makes the analysis of the transition kernel possible, whilst keeping 
the essential feature of the non-Gaussian target case, that the spectrum of 
the transition kernel has been changed by the MH algorithm. If we cannot 
analyse or accelerate case 2 then we do not expect to be able to analyse or 
accelerate case 3. 
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Moreover, in case 3 we might construct an inhomogeneous Markov chain 
by updating the proposal rule at each iteration (or every few iterations) by 
updating the local Gaussian approximation of the target distribution, hence 
we are interested in the local behaviour of the algorithm where the target 
distribution is approximately normal. 

Our analysis of case 2 is for the special case when the matrices M and 
N are functions of A. This allows us to simultaneously diagonalise both the 
AR(1) process and the target distribution with a change of variables. This 
is not an overly restrictive condition if factorizing the matrix A is infeasible, 
and we will see below that it includes several important examples of MH 
algorithms already in use. 

Our analysis is also analogous to the analysis of optimizers where it is 
useful to test the behaviour of optimizers on quadratic cost functions (in 
the probability sense, a Gaussian is equivalent to a quadratic function since 
log 7r is quadratic when ir is Gaussian). 

Acceleration techniques for AR(1) processes for sampling in case 1 (nor¬ 
mal target) do not necessarily accelerate sampling in case 3 (non-normal 
target) when the accelerated AR(1) process is used as a proposal in the 
MH algorithm. Goodman and Sokal [14] accelerated Gibbs sampling of nor¬ 
mal distributions using ideas from multigrid linear solvers, but only ob¬ 
served modest efficiency gains in their non-normal examples (exponential 
distributions with fourth moments). Green and Han [15] applied successive- 
over-relaxation to a local Gaussian approximation of a non-Gaussian target 
distribution as a proposal for the MH algorithm. Again, they did not observe 
significant acceleration in the non-normal target case. 

By recognising that MH algorithms with discretized Langevin diffusion 
or discretized Hamiltonian dynamics proposals are examples of MH algo¬ 
rithms with AR(1) process proposals, we are able to apply our theory to 
these MCMC methods. We find that we can replicate existing results in the 
literature for these methods, and we can also extend existing results in some 
cases, see Section 5. We also discuss the computational efficiency of these 
methods. 

Strictly, our analysis is exact for Gaussian target distributions, but it can 
also be applied to the case when 7r is absolutely continuous with respect to 
a Gaussian (see e.g. [24], [5] or [6]). 

Sampling from a Gaussian (or a target that is absolutely continuous with 
respect to a Gaussian) is also a problem that arises in inverse problems where 
the forward model has the form y = Fx + n where y is the observed data, F 
is a linear (or bounded for absolutely continuous with respect to a Gaussian) 
operator, x is the unknown image with Gaussian prior distribution, and n is 
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Gaussian noise, see e.g. [24, Example 6.23]. If the posterior is Gaussian with 
hyperparameters satisfying some other marginal distribution, and we can 
sample the hyperparameters independently, then conditionally sampling the 
posterior given the hyperparameters is also a Gaussian sampling problem 
[2]. Other applications that involve sampling Gaussian distributions include 
Brownian bridges [6]. 

For the purpose of evaluating the cost of computing a proposal from (2) 
for a given matrix splitting, we make similar assumptions as made in numer¬ 
ical linear algebra for solving Ax = b, since solver and sampling algorithms 
share many of the same operations. We assume that we can efficiently com¬ 
pute matrix-vector products Av for any vector v € M d ; for example A may 
be sparse. Furthermore, we assume that d is sufficiently large so that it is 
infeasible to directly compute A 1 / 2 or any other matrix factorization of A 
and then directly compute independent samples from our Gaussian target. 

The remaining sections are as follows. In Section 2 we quickly show that 
(1) and (2) are equivalent, then Sections 3 and 4 present new analyses for 
the expected acceptance rate and jump size of MH algorithms with AR(1) 
process proposals. Section 5 then applies this new analysis to proposals from 
Langevin diffusion and Hamiltonian dynamics. We see that these proposals 
are AR(1) processes and we identify the corresponding matrix splitting and 
proposal target distribution. Using our earlier analysis we assess the con¬ 
vergence properties of these methods as d —>• oo. We provide concluding 
remarks in Section 6. 

2. AR(1) processes correspond to matrix splittings. We can ex¬ 
press an AR(1) process using either (1) or (2), provided the AR(1) process 
converges. 

Theorem 2.1. If we are given G, g and £, and the spectral radius of G 
is less than 1. then the AR(1) process (1) can be written as (2) using 

( OO 

1=0 

and 

M = A(I - G) -1 , 

(4) v ’ ’ 

N = A(I-G)~ 1 G, /3 = A(I — G)~ 1 g. 

Note that we also have A = M — N symmetric and positive definite. 
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Proof. Since the spectral radius of G is less than 1 and £ is symmet¬ 
ric positive definite, it follows that A^ 1 := YliZoG l '^(G T ) 1 is well-defined, 
symmetric and positive definite. 

It is then easy to see that (3) and (4) satisfy A = M — N, G = M~ l N and 
g = We must also check that £ = M~ 1 {M T + N)M~ T is satisfied. 

Substituting (4) into M~ 1 (M T + N)M~ T we get 

M~ 1 (M t + N)M~ t = (/ - G)A~ 1 + GA~ 1 G t 
= A~ 1 -GA~ 1 G t 

oo 

= G l nG T ) 1 ~ G' / £(G T ) i 

1=0 1=1 
= £. 

□ 

In the following special case we obtain a symmetric matrix splitting. 

Corollary 2.2. If the spectral radius of G is less than 1 and GT, is 
symmetric, then the AR(1) process (1) has a corresponding matrix splitting 
defined by 

M = £- 1 (I + G), A = M(I - G) = E-^I - G 2 ), 

N = MG = ^-\l + G)G, [3 = Mg = £" 1 (I + G)g, 

and M and N are symmetric (we say the matrix splitting is symmetric). 

Proof. These matrix splitting formulae follow from the identity 

OO OO 

Y G 1 T,(G t ) 1 = Y G 21 ^ = (/ - G 2 ) -1 £. 

1=0 1=0 

To see that M is symmetric (and hence also N since A is symmetric and 
A = M — N) we note that 


M = 


(. i-g)Y g21 z 


1=0 



□ 
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3. Expected acceptance rate for a general matrix splitting. The 

expected acceptance rate is a quantity that is related to efficiency and for 
optimal performance the proposal is usually tuned so that the observed 
average acceptance rate is well away from 0 and 1. For example, it has been 
shown in the case when d oo that 0.234 is optimal for the random-walk 
Metropolis algorithm (RWM) [20], 0.574 is optimal for MALA [21], and 
0.651 is optimal for HMC [3]. All of these results required expressions for 
the expected acceptance rate of the algorithm as d —>• oo. Here we derive an 
expression for the expected acceptance rate for an AR(1) process proposal 
(2) with Gaussian target N(A~ l b, A~ x ), provided the splitting matrices are 
functions of A. Thus, our MH algorithm is defined by 

^ Target: N(A~ 1 b, AT 1 ), 

Proposal: y = Gx + M _1 /3 + (M" 1 - GA~ 1 G T ) 1 ^ 2 f t , 

where £ ~ AT(0, /), and we have used G = M~ l N and M _1 (M T +1V)M~ 2 = 
M” 1 — GA~ 1 G T [13, Lem. 2.3]. The following lemma is a result of simple 
algebra. The proof is in the Appendix. 

Lemma 3.1. Suppose A = M — N is a symmetric splitting. Then the 
MH acceptance probability for (5) satisfies 

a(x, y) = 1 A exp (^~y T (A - A)y + ^x T (A - A)x + (b - fi) T {y - x) 

Now define a spectral decomposition 

A = QAQ t 

where Q G M. dxd is orthogonal and A = diag(Af,..., A^) is a diagonal matrix 
of eigenvalues of A. Although we may not be able to compute Q and A this 
does not stop us from using the spectral decomposition for theory. Simple 
algebra gives us the following result. 

Lemma 3.2. Suppose M = M(A) and N = N(A) are functions of A. 
Then G and A are also functions of A and under the coordinate transfor¬ 
mation 

x FA Q T x 

the MH algorithm (5) is transformed to a MH algorithm defined by 
Target: N(A~ 1 Q T b 1 A -1 ), 

Proposal: y = Gx + M(A)' 1 Q T /? + (A" 1 - GA _1 G T ) 1 / 2 ^, 



(6) 
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where £ ~ N(0,I), and G = M(A) 1 N(A) and A = Xl(A) are diagonal 
matrices. 


Using Lemma 3.1 we see that the acceptance probability of MH algorithms 
(5) and (6) are identical and hence it is sufficient to analyse the convergence 
properties of (6) to determine the convergence properties of (5). 

We will need the following Lyapunov version of the Central Limit Theo¬ 
rem, see e.g. [7, Thm. 27.3] 


Theorem 3.3. Let X\,... ,Xd be a sequence of independent random 
variables with finite expected value Hi and variance af. Define s 2 d := Ya=i • 
If there exists a 5 > 0 such that 


lim 

d—>oo 



d 

^E[\x i -m\ 2+s ] = o, 

2=1 


then 

1 d 

77 — fj,i) —> iV(0,1) as d —>• oo. 

Sd t=l 

An equivalent conclusion to this theorem is Yli=i Xi i IHi s d) i n 

distribution as d —>• oo. Another useful fact is 

(7) X~N{n,a 2 ) => E[1 Ae x ] = $(L) + e Ai+<T2/2 $(-(j- 

where $ is the standard normal cumulative distribution function. See e.g. 
[20, Prop. 2.4] or [5, Lem. B.2], 

Theorem 3.4. Suppose that M and N in (5) are functions of A, and 
the Markov chain is in equilibrium, i.e. x N(A l b,A 1 ). Recall that A 2 
are eigenvalues of A, and define A 2 and Gi to be the eigenvalues of A and 
G respectively. Also define 

mi := ( A~ 1 b ) i , m := 

'h := mi - rhi 


9i ■ = 


1 - Gi , 

A 2 - A 2 


n := 


A? 


9i ■ = 


1 -Gf, 

A 2 - A 2 


r, := 


A? 
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and 


T 0i := -jji), 

I— Ti\i(viQi 9i)i 

T 2 i ■■= ri\igl /2 (l + fi) 1/2 ( 1 - riGi), 

fiii •= 2 

r 4i := -5^(1 + fi), 

T 5i := -r i G i 5 l 1/2 (l + fj) 1/2 . 


If there exists a 5 > 0 such that 

E d |t-\.| 2 + 

yu; *=1 


lim 

d—> 00 


IT- • I 2 
Z^z=l l 1 31 1 


1+5/2 


= 0 


forj = 1,2,3,4, 5 


(j = 0 is noi required), then 

'n(y)q{y,x)\ x> 


Z := log 


+)+,y) 


iV(/z, a 2 ) as d ^ 00 


where n = lim^oo X^=l hi and a 2 = lim^^ X^=i a 2 and 

hi = T 0i + T 32 + T 4 j and of = Tj 2 + Tf + 2Tjf + 2T 4i + T 2 j. 
Hence, the expected acceptance rate satisfies 

E[a(x, y)\ —> <L(^) + e /i+<T / 2 < b(— a — £) as d —>• 00 
where <h is i/ie standard normal cumulative distribution function. 


Proof. By Lemma 3.2 it is sufficient to only consider (5) in the case 
where all matrices are diagonal matrices, e.g. A = diag(A 2 ,..., A^), A = 
diag(Af,..., A 2 d ),M = diag(Mi,..., M d ), G = diag(Gi,..., G d ), m* =Xf\, 


and rhi = A ?: 2 fii. Then, in equilibrium we have 

Xi = mi + —f t 
Ai 

where G ~ JV(0,1) and using rhi = Gifhi + Mfi 1 fii we have 


Vi = G iXi + A + (1 ^ )V2 


Mi 


A i 


Vi 


1/2 


— Gj ( mi + —+ (1 — Gi)rhi H—T—n,; 


G,; 


1/2 


— ffij + Gif + — H—=— 
'Y A,; 
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where ui ~ 1V(0,1). From Lemma 3.1 we also have Z = Yli=i where 

Zi = -±(A 2 - A 2 )( y 2 - x?) + ( 6i - / 8j )( i/i - Xj ). 

Substituting Xj and j/j as above, using identities such as A 2 A “ 2 = 1 + f* and 
( b{ — (3i)\~ 2 = f{ + '/ym,;, then after some algebra we eventually find 

Zi = Tot + + Tzi£? + 


Hence 

Mi := E[Zj] = Toi + T 3 j + ?4j 

and 

+ := Var[Zj] = E[Z 2 ] - E[Z+ 

= (T 0 2 + T 2 j + T 2 2 + 3T S 2 + 3 T 4 2 + T 5 2 + 2T 0 , : T 3i + 2T 0 i T 4i + 2T 3 i T 4l ) 

~ {Toi + TJjj + T 4 j) 

= ES + T 2 + 2 T 3 2 + 2T 2 + r 5 2 


Zi — Mi — + T 3 j(£ 2 — 1) + T 4 j(t' 2 — 1) + T^^ry. 


Therefore, for any d € N and <5 > 0 we can bound the Lyapunov condition 
in Theorem 3.3 as follows 


1 “ c2+(5 ,J 

« E E »«i - »i 2+i i <i«E c >wE Rii 


|2+<5 


2+<5 

b d i=l 


< 5 


d j =1 i=l 

5 I rp 

2+<5 \ ' (%\ Z_/i=l UMl 


Ed® 

( el n 


1+5/2 


where Ci(<5) = C 2 {5) = E[|++ , C 3 (5) = C 4 (S) = E [|£ 2 - 1| 2 + and 
C 5 (5)=E[|+++,and£~lV(0,l). 

Therefore, if ( 8 ) holds then the result follows from Theorem 3.3 and (7). 

□ 


4. Expected squared jump size for a general matrix splitting. 

The efficiency of a MCMC method is usually given in terms of the integrated 
autocorrelation time which may be thought of as “the number of dependent 
sample points from the Markov chain needed to give the variance reducing 
power of one independent point”, see e.g. [17, §6.3]. Unfortunately, we are 
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unable to directly estimate this quantity for our matrix splitting methods, 
and it depends on the statistic of concern. As a proxy we instead consider 
the expected squared jump size of the Markov chain in a direction q G M d , 


E[( q T (x'-x)) 2 } 


where x,x' ~ N(A~ 1 b, A -1 ) are successive elements of the Markov chain in 
equilibrium. We will only consider the cases where q is an eigenvector of the 
precision or covariance matrix. It is related to the integrated autocorrelation 
time for the linear functional q T {-) by 


Corr[g T x, q T x'] = 1 — 


E [(q T {x' - x)) 2 ] 
2 Var[g T x] 


so that large squared jump size implies small first-order autocorrelation. 

This is similar to the approach used for analysing the efficiency of RWM, 
MALA and HMC, where the expected squared jump size of an arbitrary 
component of the Markov chain is considered (see e.g. [5] and [3]) . 

We will need the following technical lemma whose proof is in the Ap¬ 
pendix. 


Lemma 4.1. Suppose that is a sequence of real-valued numbers 

and r > 0. Then, for any k € N, 


(9) 



=> 



= 0 . 


The following main theorem for this section is a generalization of [3, Prop. 
3.8], 


Theorem 4.2. Suppose that M and N in (5) are functions of A, and 
the Markov chain is in equilibrium, i.e. x N(A x b,A - 1 ). With the same 
definitions as in Theorem 3.f, let qi be a normalized eigenvector of A cor¬ 
responding to A? (i th column of Q). If there exists a 5 > 0 such that (8) is 
satisfied, then 


( 10 ) 


V[(qJ(x'-x)) 2 ]^lhU 2 + lh 


as d —>• oo 
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where 


Ui — g i r i + -r^ + ~ 2 , 

U 2 = $((£■) + e^" +(,T-)2/2 $(-^ - fr), 

M < K 2 + M?) 1/2 X Uff + 3 J + 3 | + 6^2 + 6^ + 6 


~a a_\ 


1/2 


and, where y := and ( a f ■= E/= 1 ,/^ 






Proof. Under the coordinate transformation x -H- Q T x, (5) becomes 
(6) and E [(qf (x 1 — x)) 2 ] becomes E[(a^ — Xi ) 2 ]. Therefore it is sufficient 
to only consider the squared jump size of an arbitrary coordinate of the 
Markov chain for the case when all matrices are diagonal matrices. As in the 
proof of Theorem 3.4, let A = diag(A 2 ,..., A 2 ), A = diag(A 2 ,..., A 2 ,), M = 
diag(Mi,... ,M d ), G = diag(Gi,..., G d ), rm = X~ 2 b t , and m = A“ 2 /3;. 
Since the chain is in equilibrium we have Xi = m-j + A” 1 ^ for ~ iV(0,1) 
and yi = fhi + Gif + GjA” 1 ^ + g)/ 2 XA ^ where ~ A r (0,1). Now let I d be 
the indicator function such that 


jd , = 1 1 if u < a ( x i v) 
1 0 otherwise 


where u ~ U[ 0,1]. Then x' = I d (y — x) + (l — I d )x and {x' — x) 
For given coordinate i, define another indicator function I d ~ 


2 = I d (y — x) 2 . 
such that 


I d ~ : = 



if u < a (x, y) 
otherwise 


where a (x,y) := 1 A exp(^y =1 Zj). and define e,j := I d (y — x) 2 . The 
proof strategy is now to approximate E[(x[ — Xi ) 2 ] with E[e^]; 

E[(x' - Xi) 2 ] = E [ ei ] + E [(x'i - Xi ) 2 - e*]. 


First, consider E[e*]; 

E[ ei ] =E[I d -( yi -Xi) 2 ] 

= E [{yi - Xi) 2 ]E[a~(x,y )] 


= E 

= UiE[a~(x,y)\ 


9i j- g 1/2 
-n - —& + -=-14 
\ A i 


E[a (x,y)\ 
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Also, by Theorem 3.4 (using Lemma 4.1 to ensure the appropriate condition 
for Theorem 3.4 is met) we obtain E [a~(x,y)\ —>• U 2 as d —>• 00 . 

The error is bounded using the Cauchy-Schwarz inequality; 

|E[(z' - x,) 2 - e,]| = |E[(/ rf - I d ~)( yi - Xl ) 2 )\ 

< E[(a(x,y) - a~(x,y)) 2 ] 1/2 E[( yi - x;) 4 ] 1/2 . 

Since 1 A e A is Lipschitz with constant 1, and using results from the proof 
of Theorem 3.4, we obtain 

E[(a(x,y) - a-^x^y)) 2 } 1 ' 2 < E [Z 2 ] 1 ' 2 = (a 2 + /i 2 ) 1/2 , 


and simple algebra yields 
E[(j/i — Xj) 4 ] 1/2 = E 


A A g 1/2 N 

— Vi -— -\ - =—Vi 

\ A\ 


1/2 


A^ +3 | +3 | +6 f 


1/2 


□ 


The terms in the theorem above are quite lengthy, but in many situations 
they simplify. For example, we may have the situation where —>• [i and 
a~ —>• a as d —>■ 00 , so U 2 becomes the expected acceptance rate for the 
algorithm. Also, it may be possible to derive a bound such as 

\U 3 \ < C(n + fi) 

so that C /3 is small if both the relative error of the i th eigenvalue and error 
of the means are small. 


5. Examples. 


5.1. Discretized Langevin diffusion - MALA. The proposal for MALA 
is obtained from the Euler-Maruyama discretization of a Langevin diffusion 
process {zt\ which satisfies the stochastic differential equation 


d zt 
d t 


1 


Vk>g7T (Zt) + 


dW t 

dt 


where Wt is standard Brownian motion in M. d . Since the diffusion process 
has the desired target distribution as equilibrium, one might expect a dis¬ 
cretization of the diffusion process to almost preserve the desired target 
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distribution. Given target N(A 1 b,A 1 ), current state x € W l and time 
step h > 0, the MALA proposal y € is defined as 

(11) y = (I - §A)x + \b + y/ht 

where £ ~ iV(0, 1). Identifying this with the AR(1) process (1) and applying 
Corollary 2.2 we have proved the following theorem. 

Theorem 5.1. MALA corresponds to the matrix splitting 

M = l(I-\A), A={I-$A)A, 

N = l(I- f A)(I - |A), 13 = (I-\A)b. 

Thus, MALA corresponds to a matrix splitting where M and N are func¬ 
tions of A and our theory applies. An important feature of MALA is that 
fi = 0 for all i. This greatly simplifies the results in Theorems 3.4 and 
4.2 and we recover [5, Cor. 1] and the simpler results in [22, Thm. 7]. The 
theorem below is a special case of Theorem 5.5 so we omit the proof. 

Theorem 5.2. If there exist positive constants c and C, and k > 0 such 
that the eigenvalues X 2 of A satisfy 

ci K < A i< Ci K 

and if h = l 2 d~ r for some l > 0 and r = | + 2k then MALA satisfies 

E[a(x,i/)]->2$(-^) 

and 

(12) E [Or; - Xif] -> 2 l 2 d~ r <f> (-¥) + °( d " r ) 

as d=r oo where r = Hindoo Ya=i A®. 

Thus, the performance of MALA depends on the choice of h which is 
usually tuned (by tuning l) to maximise the expected jump distance. From 
(12), using s = Zr 1 / 6 /2, we have 

max2 l 2 d~ r & (— l —^-) = max ^ s 2 <P(—s 3 ), 

l> 0 V 8 / s>0 T 1 / 3 

which is maximised at sq = 0.8252, independent of r. Therefore, the accep¬ 
tance rate that maximises expected jump distance is 2<L(—sjj) = 0.574. This 
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result was first stated in [21], then more generally in [22, 5]. In practice, 
h (equivalently l) is adjusted so that the acceptance rate is approximately 
0.574 to maximise the expected jump size. This acceptance rate is indepen¬ 
dent of r, so it is independent of the eigenvalues of A. However, the efficiency 
of MALA still depends on r, and as r increases the expected square jump 
distance will decrease by a factor r 1 / 3 . 

The rationale for studying the case when d —» oo is that it is a good 
approximation for cases when d is ‘large’ and finite (see eg. [21, Fig. 1] or 
[22]). However, the results above suggest that we should take h —>• 0 as 
d —>• oo, to achieve at best an expected jump size that also tends towards 0 
as d —)■ oo. The only good point about these results is that it demonstrates 
superior asymptotic performance of MALA over RWM (see [5, Thms. 1-4 
and Cor. 1] and [21, Fig. 1]). 

To understand the convergence of MALA to equilibrium (burn in) we 
would like to know the “spectral gap” or second largest eigenvalue of the 
transition kernel, as this determines the rate of convergence. As far as we 
are aware this is an open problem. 

We can also use Theorem 5.1 to analyse the unadjusted Langevin algo¬ 
rithm (ULA) in [23], which is simply the MALA proposal chain from (11) 
without the accept/reject step in the MH algorithm. From Theorem 5.1 we 
see that it does not converge to the correct target distribution since A / A. 
Instead of converging to N(A~ 1 b, A^ 1 ), ULA converges to N(A~ 1 b, A~ l ). 
This is the reason why it is usually used as a proposal in the MH algorithm 
(MALA). Indeed, the authors of [23] note that ULA has poor convergence 
properties. We see here that it converges to the wrong target distribution, 
and from [11, 12] we know its convergence rate to this wrong target distri¬ 
bution depends on the spectral radius of G = I — \A, which is close to 1 
when h is small. 

Despite possibly having slow convergence per iteration, both MALA and 
ULA are cheap to compute. Since G = / — |A, we only require a single 
matrix-vector multiplication with A at each iteration and since £ = hi, 
i.i.d. sampling from A r (0, £) at each iteration is cheap. 

5.2. Discretized Langevin diffusion - more general algorithms. It is possi¬ 
ble to generalise MALA by ‘preconditioning’ the Langevin diffusion process 
and using a discretization scheme that is not Euler-Maruyama. For sym¬ 
metric positive definite matrix V € M. dxd (the ‘preconditioner’) consider a 
Langevin process {zt} satisfying the stochastic differential equation 

/ 1q n dz t ( \ , dWt 

(13) — = -FVlog n(z t ) + — 
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9 = 0 

V = I 

MALA and ULA, see e.g. [23]. In [5] the Simplified Langevin 
Algorithm (SLA) is equivalent to MALA for Gaussian target 
distributions. 

9 = 0 

< 

II 

Proposal used in [19]. After a change of variables x «-» 
Q T V~ x i 2 x it is the Preconditioned Simplified Langevin Al¬ 
gorithm (P-SLA) as in [5]. 

9 6 [0,1] 

V = I 

Proposal for the so-called (9-SLA method in [5]. Called 
Crank-Nicolson method in [9] when 9 = 1. 


V = A- 1 

Preconditioned Crank-Nicolson method (pCN), see e.g. [9]. 


Table 1 

Different choices of 9 and V in (14) lead to different proposals for the MH algorithm. 

Other choices are possible. 


where Wt is Brownian motion in with covariance V. For 9 € [0,1], time 
step h > 0 and current state x € define a proposal y € by discretizing 
(13) as 

y — x = §yVlog7T (9y + (1 — 9)x) + Vhu 
where u ~ 1V(0, V). Equivalently, when the target is N(A~ 1 b, A^ 1 ), 


(14) 


y = {I+ o_h VA yi 


{I 


( -^P±VA)x + | Vb + ( hV) 1/2 i 


where £ ~ 1V(0, 1). 

Different choices of 9 and V give different AR(1) process proposals. For 
example, MALA has 9 = 0 and V = I and the preconditioned Crank- 
Nicolson algorithm (pCN) corresponds to 9 = ^ and V = A -1 . Table 1 
describes several more examples. 

Identifying (14) with (1) and applying Corollary 2.2 we can prove the 
following theorem. 


Theorem 5.3. The general Langevin proposal (14) corresponds to the 
matrix splitting 

M = lV~ 1/2 W{I + ^B)V~ 1/2 , A = V- 1/2 WBV~ 1/2 = WA, 

N = \V~ X I 2 W{1 - ( -^^B)V~ 1 / 2 , f3 = V-^ 2 WV^ 2 b = Wb , 

where B = V 1 / 2 AV 1 / 2 , W = I + {9-\)^BandW = I + {9 - ±)§ AV. 

Proof. Take G = (I + e -±VA)~ l (I - VA ), g = \{1 + ^VA)~ 1 Vb 
and E = (/ + A)- 1 (hV)(I + ^AV)^ 1 , then (14) is the same as (1). To 

apply Corollary 2.2 we first check that GE is symmetric. We have 

G = V 1/2 (I+^B)- 1 (I-^^B) V~ 1/2 and E = hV 1/2 {I+^B)~ 2 V 1/2 
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so that 

GE = hV 1/2 {I + - £-gh.B){I + °±B)~ 2 V 1/2 

which is symmetric since V and B are symmetric. Applying Corollary 2.2 
then yields the result. □ 

Therefore, the proposal target for (14) is N(A~ 1 b, (W A)~ l ), and if 9 ^ 
then W A /, A A A, and the target and proposal target disagree. 

If 9 = then the proposal target and target distributions are the same, 
the MH accept/reject step is redundant, and we can use [11, 12, 13] to 
analyse and accelerate the AR(1) process (14). 

To evaluate the performance of the MH algorithm with proposal (14) 
when A A A we would like to be able to apply Theorems 3.4 and 4.2, but 
we see in Theorem 5.3 that the splitting matrices are functions of both B 
and V (not A) so we cannot directly apply our new theory. A change of 
coordinates will fix this! The following lemma is a result of simple algebra 
and Theorem 5.3. 

Lemma 5.4. Under the change of coordinates 

x o V~ l,2 x 

the MH algorithm with target N(A~ 1 b, A^ 1 ) and proposal (14) is trans¬ 
formed to the MH algorithm defined by 

Target: N(B~ 1 V 1 ^ 2 b, R _1 ), 

Proposal: (/ + %B)y = (/ - B)x + \V x ' 2 b + h l/2 £, 

where £ ~ N(0,I) and B = V 1 / 2 AV 1 / 2 . Moreover, the proposal (15) corre¬ 
sponds to the matrix splitting B = M — N 

M = \W{I+^B), A = WB, 

N = lW(I -£=f±B), p = WV 1/2 b , 

where W = I + {6-\)%B. 

Thus, we have transformed the MH algorithm with target N(A~ 1 b, A~ l ) 
and proposal (14) to a MH algorithm where the splitting matrices are func¬ 
tions of the target precision matrix, and we can apply Theorems 3.4 and 4.2 
to (15) to find the expected acceptance rate and expected jump size of the 
MH algorithm with target N(A~ 1 b, A” 1 ) and proposal (14). 
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Theorem 5.5. Suppose there are constants c, C > 0 and k > 0 such 
that the eigenvalues X 2 of VA (equivalently, V 1 / 2 AV 1 / 2 ) satisfy 


ci K < Xi < Ci K for i = 1,..., d. 

If h = l 2 d~ r for r = I + 2 k and l > 0, and r = lim^oo -gpbrr Yli=i then 
a MH algorithm with proposal (14) and target A^A -1 ^, A _1 ) satisfies 


(16) 


E [a(x,y)j -A 2<L - 


l3 \° ~ \\ \/E N 


and for normalised eigenvector qi of V 1 ^ 2 AV 1 ^ 2 corresponding to X 2 , 

(17) E[| qfV-^ix'-x)] 2 } -A 2 l 2 d~ r A> ^ J 3|g ~ 2 ^ +o(d -) 

as d —> 00 . 

The proof of Theorem 5.5 is in the Appendix. 

We see that V plays the same role of preconditioning as for solving a 
linear system and we should choose it minimise the condition number of 
VA, so that r is minimised. 

Although MALA and more general discretizations of Langevin diffusion 
have been successfully analyzed in [5], all of these results are stated for prod¬ 
uct distributions and ‘small’ perturbations of product distributions. Theo¬ 
rem 5.5 extends their theory (in particular [5, Cor. 1]) to the Gaussian case 
with non-diagonal covariance and 6 G [0,1]. See also [22, Thm. 7]. 

For efficiency, as well as considering the expected squared jump size, we 
must also consider the computing cost of the proposal (14), which requires 
the action of (I + A)~ l and an independent sample from 1V(0, V), as 
well as the actions of V and A. 

MALA, with 0 = 0 and V = /, is very cheap to compute because we only 
have to invert the identity matrix and sample from N (0, 1) at each iteration 
(as well as multiply by A). 

Alternatively, pCN, with 6 = ^ and V = A -1 , requires us to sample from 
JV(0, A -1 ) which we have assumed is infeasible because we cannot factorize 

A. 


5.3. Hybrid Monte Carlo. Another type of AR(1) process proposal that 
fits our theory are proposals from the Hybrid (or Hamiltonian) Monte Carlo 
algorithm (HMC), see e.g. [10, 3, 17]. 
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HMC treats the current state x € as the initial position of a parti¬ 
cle, the initial momentum p G of the particle is chosen independently at 
random, and then the motion of the particle is evolved according to a Hamil¬ 
tonian system for a fixed amount of time. The final position of the particle is 
the proposal. Instead of solving the Hamiltonian system exactly, the evolu¬ 
tion of the particle is approximated using a reversible, symplectic numerical 
integrator. For example, the leap-frog method (also called the Stormer-Verlet 
method) is an integrator that preserves a modified Hamiltonian, see e.g. [16]. 
Hence the proposal y, for target N(A~ 1 b, A -1 ), is computed as follows; let 
V € M. dxd be a symmetric positive definite matrix and define a Hamiltonian 
function H : R. d x U. d —>• M by 

(18) H(q,p):=\p T Vp+^q T Aq-b T q. 

Given a time step h > 0, a number of steps L G N, and current state x G M d , 
define qo '■= x, and sample po N (0, V x ). Then for l = 0, ..., L— 1 compute 

Pl+ 1/2 = Pi - | (-4 qi - b ), 

Ql+1 =Ql + hVp l+ 1 / 2 , 

Pi+ 1 = Pi+ 1/2 - - b). 


The proposal is then defined as y := qL- In matrix form we have 


qi+i 

= iv 


+ J 

0 
h 7 

_ pi +i . 


. Pi . 


— b 

L 2 u J 


where K, J G M 2dx2rf are defined as 
K = 

and 

J = 


/ o' 


' I hV ' 


/ 0 ' 


I - ^-VA hV 



0 / 




-hA+^AVA I-^-AV 


' I 

0 ' 

+ 

I 

0 ' 


' / 

hV ' 


2 / 

hV 

0 

I 

_ h A 

L 2 ^ 

/ 


0 

/ 


h a 
l 

21 - ^j-AV 


Hence, y is given by 

(19) V =K 1 

PL 


x 

£ 


L—l 


+ Y J Rlj 


1=0 


0 

-b 
L 2 U 


where £ ~ 1V(0, V) 
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or equivalently, 


( 20 ) 


y = {K L )ux + 



o 

hb 
2 u 


+ (K L ) 12 ti 


where £ ~ IV(0, V -1 ), ( K L )ij is the ij block (of size d x d) of K L , S = 
(/ — K)~ l {I — K l ) and (-)i are the hrst d entries of the vector (•). 

In the case of only one time step of the leap-frog integrator (L = 1) then 
HMC is MALA [3]. Hence, we immediately know that the HMC proposal 
with L = 1 is an AR(1) process where the proposal target and target distri¬ 
butions are not the same, and the expected acceptance rate and jump size 
are given by Theorem 5.2. The case for L > 1 is more complicated, but (20) 
is still an AR(1) process that can be expressed as a matrix splitting using 
(2). The proofs of the following two results are in the Appendix. 


Theorem 5.6. The HMC proposal (20) corresponds to the matrix split¬ 
ting 


M = T~\l + (K L ) 11 ), A 

n = t- 1 (i + (k l ) 11 )(k l ) 11 , p 


X~\I-(K l ) 2 u ), 

S- 1 (/ + (K l ) 11 ) ((SJ 


0 

-b 

2 u 


where E = (K L ) 12 V- 1 (K L )J 2 . 


Corollary 5.7. The matrix splitting from HMC satisfies A 1 f 3 = A 1 b. 


These results imply that the proposal target distribution for HMC is 
N(A~ 1 b,A~ 1 ) rather than the desired target N(A~ 1 b, A^ 1 ), hence why this 
AR(1) process is used as a proposal in the MH algorithm. 

For the analysis of HMC we require the eigenvalues of the iteration matrix. 
A proof of the following result is in the Appendix. 

Theorem 5.8. Let A? be eigenvalues ofVA. Then the iteration matrix 
G = ( K L )n for the HMC proposal has eigenvalues 

Gi = cos (Ldi) 

where 8i = — cos _1 (l — ^-A?). 

From this theorem we see how the eigenvalues of the iteration matrix 
depend on V, the number of time steps L, and the time step h. Again we 
refer to V as a preconditioner (as in [4]) because it plays the same role as a 
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preconditioner for solving systems of linear equations. Alternatively, V may 
be referred to as a mass matrix since p in the Hamiltonian (18) is momentum 
and H is energy. 

To complete our analysis of HMC we restrict our attention to the case 
when d —>• oo and try to apply Theorems 3.4 and 4.2. These theorems require 
that the splitting matrices are functions of the target precision matrix. A 
simple change of coordinates achieves this. 


Theorem 5.9. Under the change of coordinates 

x 1 r V 1 / 2 0 

p ’ ”''* ereV = o V -M2 


X 

p 


^ V 


-l 


r,2dx 2d 


the Hamiltonian (18) and HMC with target N(A 1 b,A x ) and proposal (20) 
are transformed to a Hamiltonian, and HMC defined by 


( 21 ) 


Hamiltonian: H{x,p) := \p T p+ \x T Bx— (V 1 ^ 2 b) T x, 

Target: jV(H _1 U 1/2 6, H" 1 ), 


0 

W l / 2 b 


Proposal: y = (JC L )nx + ( SJ 

\ L 2 ' “ -1/ 1 

where t ~ JV(0, 1), B = V l / 2 AV 1 / 2 , S = (/ - - IC L ), 

1C = 


+ (/C L )l 2 ? 


I-^-B hi 



21 hi 

-hB + ^B 2 I-^-B 

and 

J = 

-\B 21- ^-B 


Moreover, the proposal (21) corresponds to the matrix splitting A = M — N 
M = (/C i )r 2 2 (/+(/C L ) 11 ), A = (K L )fi(I-{lC L ) 2 1 ), 
iY = (/C i )r 2 2 (/+(/C i )n)(/C L ) 11 , /3=(/C i )r 2 2 (/+(/C i ) 11 ) (sj 


0 

L \V x ! 2 b 


Proof. Use K = V/CV" 1 and J = VJV~\ 


l 

□ 


Similar coordinate transformations are used in classical mechanics [1, p. 
103], see also [8]. 

We now have splitting matrices that are functions of the target precision 
matrix, so we can apply Theorems 3.4 and 4.2 to the HMC defined by (21) to 
reveal information about the performance of the original HMC algorithm. To 
avoid being overly technical we restrict ourselves to the case where A* = i K 
for some k > 0. This is still an extension to the results in [4] since they only 
consider the case when k = 0. A proof is in the Appendix. 
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Theorem 5.10. Suppose that for some k > 0, the eigenvalues of VA 
(equivalently, V 1 / 2 AV 1 / 2 ) satisfy 

Xi = i K for i = 1,..., d. 

If h = ld~ r for r = \ + k and l > 0, and L = for fixed T, then the HMC 
algorithm (with proposal (20) and target N(A^ 1 b, A -1 )) satisfies 

( 22 ) E[a( „) M 2 * (- i 7 =r= 

and for eigenvector qi of V 1 / 2 AV 1 / 2 corresponding to X 2 , 

( 23 , (__T_) +o(£r v 2) 

as d —>• oo, where T' = Lh. 

The computational cost of HMC should also be considered. Each proposal 
of HMC requires an independent sample from N(0, V^ 1 ), L matrix-vector 
products with V and L + 1 matrix-vector products with A. Again, there is 
a balance to be struct optimizing the convergence rate relative to compute 
time (rather than iteration count). Extending the results in [3], we have 
shown that in the case when the eigenvalues of VA are i K , T is fixed and d —>■ 
oo, if we take h = ld~ l ^~ K and L = L 77 J then the acceptance rate is 0(1), 
so in high dimensions the number of iterations of HMC per independent 
sample should scale like 0(d 1 ^ +K ), which is an improvement over MALA 
which requires 0{df^ +K ). 

This theory is for the leap-frog numerical integrator applied to the Hamil¬ 
tonian system. Higher order integrators are also suggested in [3] and alterna¬ 
tive numerical integrators based on splitting methods (in the ODEs context) 
are suggested in [ 8 ] that minimize the Hamiltonian error after L steps of the 
integrator. It may be possible to evaluate these other methods using Theo¬ 
rems 3.4 and 4.2 after a change of variables. 

We also note that the variant of HMC in [4] corresponds to V = A -1 
and the change of variables p •<->• Vp. Since this method requires the spectral 
decomposition of A for computing samples of A^C^Ax 1 ) it is infeasible in 
our context. 

6 . Concluding remarks. Designing proposals for the MH algorithm 
to achieve efficient MCMC methods is a challenge, particularly for non- 
Gaussian target distributions, and the job is made harder by the difficultly 
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we have in analysing the convergence properties of MH algorithms. By fo¬ 
cusing on AR(1) process proposals in high dimension we have proven new 
theoretical results that provide us with criteria for evaluating AR(1) process 
proposals and guide us in constructing proposals for efficient MH algorithms. 

We have shown there is flexibility in how AR(1) processes are expressed, 
and by using the matrix splitting formalism we easily identify the proposal 
target distribution. In particular, we have shown that all convergent AR(1) 
processes can be written in matrix splitting form, so that (1) and (2) are 
interchangeable. These include the proposals for MALA, HMC, and other 
discretized Langevin diffusion and discretized Hamiltonian dynamics pro¬ 
posals. Since all AR(1) processes of the form (2) correspond to fixed-scan 
Gibbs samplers, we conclude that MALA and HMC are fixed-scan Gibbs 
samplers, albeit with a proposal target distribution that is not the desired 
target distribution. 

A special case is when the target distribution is normal (but not the same 
as the proposal target) and the splitting matrices are functions of the preci¬ 
sion matrix for the target distribution. In this case we proved new results for 
the expected acceptance rate and expected squared jump size when d —>• oo. 
We showed how these quantities depend on the eigenvalues of the iteration 
matrix, and the difference between the proposal target distribution and the 
target distribution. 

Although our new results are for the special case case of normal target 
distributions, they keep the essential feature of non-nornral targets, because 
the MH accept/reject step must be used to get convergence to the correct 
target distribution. Our results also provide us with guidance for the case 
when the target distribution is non-nornral, since we can take N(A~ l f3,A) 
to be a local normal approximation to it. Moreover, our assumption that the 
splitting matrices are functions of the target precision matrix A (when the 
target is normal) is natural in high dimensions since it may be infeasible to 
factorize A. 

Designing an efficient MH algorithm with an AR(1) process proposal is of¬ 
ten a balancing act between minimising the integrated autocorrelation time 
(we use maximising expected jump size as a proxy for this) and minimising 
compute time for each iteration of the chain. If the proposal target and tar¬ 
get distributions are Gaussian and identical then it follows from the theory 
in [11, 12, 13] that to construct an efficient AR(1) process we should try to 
satisfy the following conditions: 

1. The spectral radius of G should be as small as possible. 

2. The action of G or M -1 should be cheap to compute. 

3. Independent sampling from N(g, E) or N(M~ 1 b, M~ 1 (M T + N)M~ T ) 
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should be cheap to compute. 

However, if the Gaussian proposal target and target distributions are not 
the same, then our new theory suggests that in addition, we also require: 

4. the difference between the desired target and proposal target distribu¬ 
tions should be as small as possible in the sense that the difference in 
means should be small, and the relative difference in precision matrix 
eigenvalues should be small. 

In particular examples we can quantify these conditions using our the¬ 
ory. For example, for proposals based on discretized generalised Langevin 
diffusion, Theorem 5.5 shows us how the choice of symmetric positive defi¬ 
nite matrix V effects efficiency as it effects squared jump size in four ways. 
Whilst choosing V to maximise the limit in (17) (by minimising r and r) 
we should balance this against the scaling and direction that V induces on 
E[qfV~ 1 ^ 2 (x' — x ) 2 ] through q ,; and I/ -1 / 2 on the left-hand side of (17). 

Another example is pCN which satisfies conditions 1, 2 and 4 above, but 
not condition 3. In particular, G is the diagonal matrix with entries all 
on the diagonal, and A = A and /3 = b so the proposal target and desired 
target are the same. However, each iteration of pCN requires an independent 
sample from A^(0,A _1 ), which may be infeasible in high dimension. In the 
special case when A is diagonal, or a spectral decomposition of A is available, 
then pCN satisfies all of our conditions for an efficient method. 

By applying our new results to existing MCMC methods such as MALA, 
HMC, and other discretized Langevin diffusion and discretized Hamiltonian 
dynamics proposals we extended results already in the literature. In partic¬ 
ular, we extended results for Langevin proposals in [5] to Gaussian targets 
with non-diagonal covariance and 9 £ [0,1]. For HMC, we extended results 
in [4] to the case when k > 0 from k = 0. We have also derived a new 
formula for the eigenvalues of the iteration matrix of the HMC proposal. 

Proposals for MALA and HMC are examples of proposals that are con¬ 
structed by discretizing a stochastic differential equation that preserves the 
target distribution. Our theory allows us to broaden the class of AR(1) 
process proposals for the MH algorithm to more general AR(1) process pro¬ 
posals. 

Whilst we do not specify any new acceleration strategies, our results are 
an important step in this direction because we give a criteria to evaluate 
AR(1) process proposals, including accelerations used in Fox and Parker 
[11, 12, 13]. Acceleration techniques for MH algorithms is an avenue for 
further research. 
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APPENDIX A: PROOFS 
A.l. Proof of Lemma 3.1. First note that 

q{x, y) oc exp(^(My — Nx — f3) T (M + N)~ 1 {My — Nx — /3)). 
Simple algebra then yields 



- (.Mx — Ny — (3) t (M + N)- l (Mx - Ny - p) 

+ {My - Nx - p) t {M + N)~ 1 {My - Nx - (3) 

= —y T Ay + x T Ax + 26 T \y — x) 

- ((M - N){x + y)) T {M + N)-\{M + N){x - y)) 
+ 2/3 t (M + N)-\(M + N)(x- y)) 

= -y T (A - A)y + x T {A - A)x + 2(6 - /3) T (y - x). 


A.2. Proof of Lemma 4.1. Suppose lim^ 0O (X]f = i \U\ r )/{Yli=i 


= 0. Then for any e > 0 there exists a D € N such that for any d > D, 


Ya=i \ti\ r < e iYli=i Then for any k € N, taking e = 2 r / 2 , there 


exists a D > k such that for any d > D 



Therefore, for any d > D, t\ < \ Yli=i an d so 



A.3. Proof of Theorem 5.5. We use the following technical lemma 
in the proof of Theorem 5.5. 

Lemma A.l. Suppose {ti} is a sequence of real numbers such that 0 < 


U < Cd V 3 (1) 2k f or C > 0 and n > 0. If s > 3, then Hindoo Xa=i = 0- 


Proof. 


d 


d 


L 



lim V t\ < C s lim d 1_s/3 V I (i) 2KS = C s lim d 1_s/3 


z 2KS dz = 0. 


2=1 


2=1 


□ 
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Proof of Theorem 5.5. First note that VA and V 1 ^ 2 AV 1 ^ 2 are similar, so 
they have the same eigenvalues. Lemma 5.4 implies that it is equivalent to 
study the MH algorithm with target and proposal given by (15), and since 
the splitting matrices for (15) are functions of B = V 1 ^ 2 AV 1 ^ 2 we can apply 
Theorems 3.4 and 4.2 to (15). If we let U = h\ 2 and p = (8 — \)/2 we have 


k — (i + pu) at 

O = ~PU, 
so that 


Gi = 1 - 


Lf 

2 


l+. 

2 H 


i + lu' 3i i + W 


2 (/ * 


1+0 = 


1 + pti 


fi = 0, 


9i = 


tj( 1 + ptf) 

1 + | ti 


To i — Tu — T 2 i — 0, 
+ pti) 


T :il = 


(1 + | U) 2 


T\_i — 


hpt 2 i 


(i + luy 


T§i — 


X' a - 

(i + f+ 2 


To apply Theorems 3.4 and 4.2 we first need to check (8). For sufficiently 
large d, we have 1 < 1 + | and 1 < 1 + pti < |. Then for some 5 > 0 

we have 

I TV 1 2+5 f i+25 

lim ^ i=l1 3l] _ < lim (3\6+35 2^i=ih 

d^o f^ d lm in \ !++ 2 - [2 ' 


(Stiir 3 .l 2 ) 


(T d . (A 1+l!/2 


< 


d—> oo 


— 1 ^^8^+45/^ 
1+5/2 


\^Z^2=1 J 

lim (|) 6+M ( g)8+4g d -^/2 

^°° c (Et+- 1 (^+ 

hm (3\ 6 + 3 +C)8+43 d -3/2 fp Z 8K+4dK dz 
,™J 2 / id a , xi+a/2 u - 

(/o^z) 


Similarly, we can check that (8) is satisfied for j = 4, 5. Now we can apply 
Theorem 3.4, with 


T = 


lim V pi = lim T+ + Z+ 

d —>OO ‘ /7 — S -J 


i= 1 


d—>oo 4 


2=1 


= hm — - 


E 


d->oo " “j'' (1 + |tj) 2 d-s-oo 


= lim — +- 


i G {d-\) 2 T 


2=1 
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and using Lemma A.l, 
d 


a 2 = 


lim \^ a 2 = lim 2T 3 2 + 2T 4 2 + T 5i 


2=1 

d 


d —^OO 


2=1 


, lim X V u (^4 u+ p *o 2 + +Ip 2 ^ 1 - V^) 2 ) 

d ^°° i=1 (! + 3*0 

, lim X/ (sP 2 *^ 1 + p*0 2 + hp 2t t + iP 2t f( l - ^t-U) 2 ) 

a—too 


2=1 

d 


lim X I A 3 

d—t oo z ' 


2=1 


= lim 

d—> oo 


J 6 (*-*) 2 r 


It follows that L = — cr — L = _£ 3 |# — I|^/7/4 and n + = 0. Hence 

obtain (16). 

For the expected jump size, first note that 


we 


U\ = % + i- = 


h 


1 + 


If. 

4 H 


= l 2 d~ r + o(d -1//3-r ). 


(1 + fit) \ (1 + 1^0 

Also, it is easy to show that /ij = o(d _1 ) and a 2 = o(d -1 ) so 

/ 3 |P - \\y/r 


U 2 E[a(x,y)] = 2$ — 


as d —>• 00 , 


and 


IKjI < (»? + /*?) 1/2 
= K 2 + k ?) 1/2 


Jt , O Si ,*$9* 

~i + V yF 


1/2 


\/3 /i 

4 (1 + fiO 


4 + 


1 + f U 


Therefore, applying Theorem 4.2 with the coordinate transformation x -H> 
V~ 1//2 x we obtain (17). 
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A.4. Proof of Theorem 5.6. The result will follow from Corollary 2.2 
with G = (K L ) ii and Y = (K l )i 2 V~ 1 (K L )J 2 but we must first check that 
GY, is symmetric. Define V, /C and B as in Theorem 5.9. Then K = V/CV -1 , 
so that K l = V7C L V -1 and 

(. K L ) n = V 1 / 2 (JC L )uV - 1/2 and ( K L ) l2 = V l/2 {K L )i 2 V l/2 . 


Then 


GY = V 1/2 {JC L )n(JC L ) 2 12 V 1/2 

which is symmetric because V and B are symmetric and (IC L ) n and (IC L ) 12 
are polynomials of B. 


A. 5. Proof of Corollary 5.7. First note that 


(/-(A L ) n )A^= ( SJ 
so we are required to show that 

(I-(K L ) 11 )A~ 1 b= ( SJ 

which holds if 


0 

-b , 

L 2 U J /1 


-b 

L 2 U 


(.I - K 1 


A~ l b 

0 


= SJ 


0 

-b 

L 2 ' J 


Using S = (/ — K) 1 (I — K L ), we can equivalently show 


(I-K) 


A~ l b 

0 


= J 


0 

-b 

L 2 U 


which, is easy to check. 


A.6. Proof of Theorem 5.8. Define a spectral decomposition 
(24) V 1/2 AV 1/ 2 = QAQ t 

where Q is an orthogonal matrix and A = diag(A^f,..., A^) is a diagonal 
matrix of eigenvalues of P 1 / 2 AV 1 ^ 2 (VA is similar to V 1 / 2 AV 1 / 2 so they 
have the same eigenvalues). Also define V as in Theorem 5.9 and 


Q 


Q 0 
0 Q 


<E R 


2dx2d 
















EFFICIENCY OF MCMC 


29 


A similarity transform of K is defined by 


K = VQKQ t V ~ 1 with K = 


/ - A hi 
-hA + ^A 2 I-% A 


Hence K and K have the same eigenvalues. Moreover, K L = VQK L Q T V 1 
and it follows that 


{K L )n = V 1/2 Q(K L ) 11 Q T V~ 1/2 . 

Thus (K L ) 11 and ( K L )u are similar. 

Notice that K is a 2 x 2 block matrix where each d x d block is diagonal. 
Therefore, K L is also a 2x 2 block matrix with diagonal blocks. In particular, 
(I< L ) ii is a diagonal matrix, so the eigenvalues of ( K L )u are on the diagonal 
of (K L ) ii. Moreover, 

l(k L ) n]« = (kf) ii 

where [{K L )n]n is the i th diagonal entry of (. K L )u , (kh )n is the (1,1) entry 
of the matrix k- J € M 2x2 , and ki € M 2x2 is defined by 


' (kn)a 

(k 12 )u ' 


1 - ¥a 2 h 

_ (K 21 )» 

(k 22 )n 


-h\ 2 + ^A^ 1 — -^-A 2 


The matrix ki can be factorized 


1 0 ' 


cos (Oi) 

- sin (Oi) 


' 1 O' 

0 a 


sin (Oi) 

cos (Oi) 


0 a- 1 


where a 


Ay/l-^A 2 and 0 t 


cos(l — ^-A 2 ). Therefore, 



1 0 ' 

cos (LOi) 

— sin (LOi) 


' 1 O' 

0 a 

sin (LOi) 

cos (LOi) 


0 a” 1 


and hence 

[(k L ) n]jj = (ki) 11 = cos (LOi). 


A.7. Proof of Theorem 5.10. 

Proof. Theorem 5.9 implies that it is equivalent to study the MH algo¬ 
rithm with target and proposal given by (21), and since (/C L )n and (IC L ) 12 
are functions of B we can apply Theorems 3.4 and 4.2. Using the spectral 
decomposition (24) note that 

(fc L )ij = Q(k L )ijQ T for i,j = 1,2. 
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where K is defined in the proof of Theorem 5.8, and where it is shown that 
{K L )ij is diagonal and 

[(K L )u]n = cos (LOi) 
where 0i = —cos _1 (l — 4^-A 2 ). Similarly, 

\(K L )i 2 ]u = —a” 1 sin(L0j) 

where a* = 1 — 4pAf. Moreover, A = Q(K L )\ 2 (I — ( K L )\ l )Q T so that 

A 2 = —a? and if we let ti = h? A?, then 

A 2 = A 2 (l Gi = cos(L6»j), g t = 1 - cos (Lf9, : ), ^ = sin 2 (L6' i ), 

n = jti, 1 + r* = —t—, fi = 0. 

1— 

Note that we used Corollary 5.7 to show Vi = 0. Then 


To,; — Ti, — T 2i — 0 , 
T 3i = + sin 2 (L<9,), 


Ty = — 


hi sin 2 (LOi) 


1 - h 


T& = — 


hi sin(2 LOi) 




i - hi 


The trigonmetric expansion cos X (1 — z) = y/2.~z + 0(z 2 ’/ 2 ), ti = l 2 d 1 / 2 (1 ) 2k 
= o(d~ 1 / 2 ) and defining T' such that L = implies there exists a function 
T"(d ) such that 

= Aj(T' + o^- 1 / 2 )) = {^) K (T'd n + o{d K ~h 2 ) = : (i) K T"(d), 
and T"(d) —>• oo as d —>• oo. 

To apply Theorems 3.4 and 4.2 we need to check (8). For some h > 0 we 
find 

Km H-Vy* _ _ lim Eti It.sin^woi^ 

d 5 >oo / 

l 2^=1 H3i 


2V12 


1+<5 / 2 d ~^°° fs^d u . T P.W2 


1+5/2 


_ d ~ s/2 Eli rf ~ 1 |(g) 2K sin 2 ((A) K r"(rf))| 2+<5 
d^oo j_l|/i\ 2 K„;„ 2 //'i\ K ™/'j\\| 2 > \ 1+5/2 


= lim 


= lim d 

d—>oo 


= lim d 

g£—>■ oo 


(E“=+- 1 l(^ 2 K sin 2 ((A)«r"(d))| 2 j 
-5/2 fo \ z2K sin 2 (z K T")j 2+6 dz 
(fo \ z2k Sd h 2 { zK T")\ 2 dz^ 

5!2 Ip 1 \z 2K \ 2+5 dz 


v 2 ac 


| 2 dz^ 


1+5/2 


= 0. 
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Similar arguments verify (8) for T 4i and T 5i . Now we can apply Theorem 3.4 
with 


d 

H = lim V T 3i + T 4i 

d^oo z ' 
i—1 


d ->oo 32 1 — ~ti 

2=1 4 1 


d—>-00 32 
l 4 


i=l 

d 


Mm -- 

1=1 

1 4 /•! 

lim-/ z 4 K sin 2 (z K T"(d))dz 

£—^cxd 32 q 

/*27T /*1 

/ sin 2 (z / )dz / / / K dz 
Jo Jo 


32 2vr 
Z 4 


64(1 + 4k) 


and similarly, 
d 


a 2 = lim ^2T 2 +2T 2 +T 

d—>-oo z ' 


2=1 

d 


lim Vl < ; s in*(Lft) + 4 t ‘ Sill4 , (M : ) + 1 

/-Inn ^ 


d —^-oo ^^ 32 
2=1 


32 (1 - 1^)2 64 1 - \ tl 


X] sin *(L0i) + ^U 2 sin 2 (2 LOi) 


d -^ f-OG — J 16 
2=1 


+ ^<r 1 (5) 4 *sin a (2(j)“T"(<0) 


d— >oc ' 16 

i= 1 


/ 4 f 1 / 4 Z" 1 

lim - / / K sinVT")dz+- / , 
d^oo 16 Jo 64 Jo 


/ K sin 2 (2z K T ,, )dz 


£ (s f smV)d2 ') 4 (* 1 2 “ di 


Z 4 


32(1 + 4k)' 
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Hence ^ = —a — £ = — and p + a 2 /2 = 0, so from Theorem 3.4 

we obtain (22). 

For the expected jump size, we apply Theorem 4.2 with 

g 2 g t (1-cos {L6i)) 2 sin 2 (L9i) 2(1 - cos(A iT')) 

Ul = y + if = —v——v — + ° (i } 

as d —>• oo. Also, it is straightforward to show that /ij = o(d -1 ) and cr 2 = 
o(d -1 ). Hence 


U 2 —t E[a(x,?/)] = 2<f> - 


/ 2 


8^2 VI + 4k 


as d —>■ oo, and 


|t/ 3 | < (^+M?)' /2 (3^ + 3|| + 6 


X 4 ' ' "A?A 2 ) 




1/2 




= (^ + m 2 ) 1/2 a? 


(l-cos(Le i )) 4 + 3in4 f 9 b +2 (i-^(rfl i )) 2 sin 2 (z,9p 




1/2 


= o(d-^,. 

Therefore, we obtain (23). 


□ 
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